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A  causal  distributed  breakpoint  is  initiated  by  a  sequential  breakpoint  in  one  process  of 
a  distributed  computation,  and  restores  each  process  in  the  computation  to  its  earliest 
state  that  reflects  all  events  that  ‘‘happened  before"  the  breakpoint.  A  causal  distributed 
breakpoint  is  the  natural  extension  for  distributed  programs  of  the  conventional  notion  of 
a  breakpoint  in  a  sequential  program.  We  present  an  algorithm  for  finding  the  causal  dis¬ 
tributed  breakpoint  given  a  sequential  breakpoint  in  one  of  the  processes.  Approximately 
consistent  checkpoint  sets  are  used  for  efficiently  restoring  each  process  to  its  state  in  a 
causal  distributed  breakpoint. \ 

Causal  distributed  breakpoints  assume  deterministic  processes  that  communicate  solely 
by  messages.  The  dependencies  that  arise  from  communication  between  processes  are 
logged.  Dependency  logging  and  approximately  consistent  checkpoint  sets  have  been  im¬ 
plemented  on  a  network  of  SUN  workstations  running  the  V-System.  Overhead  on  the 
message  passing  primitives  varies  between  between  1  and  14  percent  for  dependency  log¬ 
ging.  Execution  time  overhead  for  a  200  x  200  Gaussian  elimination  is  less  than  4  percent, 
and  generates  a  dependency  log  of  288  kilobytes.  /  ^ 
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1  Introduction 


At  a  breakpoint  the  programmer  would  like  to  see  the  program  in  a  state  that  reflects  all  events 
that  have  had  a  causal  effect  on  the  state  of  the  program  at  the  breakpoint.  With  a  sequential 
program,  this  is  straightforward,  since  only  events  earlier  in  time  have  a  causal  effect  on  events 
later  in  time,  and  sequential  execution  totally  orders  all  events.  In  distributed  programs,  however, 
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breakpoints  are  more  complicated,  since  events  are  only  partially  ordered,  and  since  it  is  impossible 
to  obtain  am  instantaneous  snapshot  of  the  entire  program. 

A  causal  distributed  breakpoint  is  the  natural  extension  for  distributed  programs  of  the  conven¬ 
tional  notion  of  a  breakpoint  in  a  sequential  program.  A  causal  distributed  breakpoint  restores  each 
process  to  the  earliest  state  that  reflects  all  events  that  happened  before  the  breakpoint,  according 
to  Lamport’s  partial  order  of  events  in  a  distributed  system  [8],  In  contrast,  previous  definitions  of 
distributed  breakpoints  simply  find  any  consistent  global  system  state  including  the  breakpoint. 

We  present  an  algorithm  for  finding  the  causal  distributed  breakpoint  given  a  sequential  break¬ 
point  in  some  process.  The  implementation  of  a  causal  distributed  breakpoint  requires  a  facility 
for  restarting  and  replaying  the  execution  of  individual  processes  by  means  of  dependency  logging-. 
Each  message  carries  with  it  an  event  index  for  the  sending  process  at  the  time  the  message  was 
sent.  This  facility  for  restart  and  replay  requires  that  process  execution  be  deterministic. 

The  logging  of  dependency  information  is  essential,  but  the  message  data  may  or  may  not  be 
logged.  Logging  the  data  allows  quick  restoration  of  processes  to  a  particular  state,  but  requires 
much  larger  logs.  The  use  of  approximately  consistent  checkpoint  sets  expedites  restoration  of 
process  events,  permitting  the  use  of  dependency  logging  for  all  messages  except  those  sent  during 
the  time  the  checkpoint  is  being  taken.  This  keeps  the  logs  small. 

We  describe  our  model  of  distributed  computation  in  Section  2  of  this  paper.  We  define  causal 
distributed  breakpoints  and  motivate  their  definition  in  Section  3.  Section  4  contains  the  causal 
distributed  breakpoint  algorithm,  and  Section  5  shows  how  to  restore  the  system  to  a  causa]  dis¬ 
tributed  breakpoint  using  approximately  consistent  checkpoint  sets.  Some  implementation  experi¬ 
ence  is  described  in  Section  6.  In  Section  7  we  survey  related  work,  and  in  Section  8  we  present 
conclusions. 

2  Distributed  System  Model 

We  consider  a  collection  of  deterministic  processes  communicating  solely  via  messages.  The  execu¬ 
tion  of  the  individual  processes  is  deterministic  in  the  following  sense:  If  two  processes  start  in  the 
same  state  and  receive  the  same  sequence  of  input  messages,  they  terminate  in  the  same  state  and 
send  the  same  sequence  of  output  messages.  An  event  is  the  sending  or  the  receipt  of  a  message. 
An  event  is  uniquely  identified  within  the  process  in  which  it  occurs  by  an  event  index ,  which  is 
incremented  by  one  at  each  event. 

Events  are  partially  ordered  by  Lamport’s  “happened  before”  relation,  denoted  — »,  and  de¬ 
fined  [8]: 

1.  If  o  and  6  are  events  in  the  same  process,  and  a  occurs  before  6,  then  a  —*  b. 

2.  If  a  is  the  sending  of  a  message  and  6  is  the  receipt  of  the  same  message,  then  a  — *  b. 

3.  If  a  — *  6  and  6  —  c,  then  a  — »  c. 

Two  events  a  and  6  are  concurrent  if  and  only  if  a  6  and  b  /*  a. 

A  process  state  is  an  observation  of  the  state  of  a  process  at  some  point  in  its  execution.  The 
process  states  of  a  particular  process  are  totally  ordered  with  respect  to  each  other  and  with  respect 
to  all  events  happening  in  that  process.  A  process  state  of  a  particular  process  reflects  an  event 
occurring  in  that  process  if  the  event  happens  before  that  process  state.  A  system  state  is  a  set  of 
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process  states,  one  for  each  process.  A  system  state  is  consistent  if  and  only  if  for  each  process  i,  if 
the  state  of  process  i  in  the  system  state  reflects  event  crt,  then  the  state  of  all  other  processes  must 
reflect  all  events  that  happened  before  event  <?{.  That  is,  all  messages  received  by  all  processes 
must  have  been  sent.  Only  consistent  system  states  can  occur  during  normal  execution. 


3  Causal  Distributed  Breakpoints 

3.1  Definition 

A  causal  distributed  breakpoint  is  initiated  by  the  occurrence  of  a  breakpoint  in  the  breakpoint 
process.  The  breakpoint  event  is  defined  as  the  latest  event  in  the  breakpoint  process  that  happened 
before  the  breakpoint.  A  causal  distributed  breakpoint  is  defined  as  a  system  state  that  consists  of, 

1.  for  the  breakpoint  process,  its  state  at  the  time  of  the  breakpoint,  and 

2.  for  all  other  processes,  the  earliest  process  state  that  reflects  all  events  in  that  process  that 
happened  before  the  breakpoint  event. 


Any  causal  distributed  breakpoint  is  a  consistent  system  state. 

With  a  sequential  breakpoint,  the  state  of  a  sequential  program  reflects  all  events  that  oc¬ 
curred  in  physical  time  before  the  breakpoint.  The  naive  extension  of  this  notion  to  a  distributed 
program — i.e.,  the  state  of  all  processes  such  that  they  reflect  all  events  that  occurred  in  physical 
time  before  the  breakpoint — is  not  achievable,  because  it  is  not  possible  to  take  an  instantaneous 
snapshot  of  the  entire  system.  Nor  is  it  appropriate,  because  events  that  precede  the  breakpoint  in 
physical  time  can  obscure  the  causal  relationship  between  events  in  a  distributed  program.  Causal 
distributed  breakpoints  extend  the  notion  of  a  sequential  breakpoint  so  that  the  system  state  reflects 
all  events  that  happened  before  the  breakpoint  according  to  Lamport’s  “happened  before”  partial 
order,  which  is  an  observable  ordering.  In  Figure  1,  the  execution  of  three  processes  is  shown, 
with  time  going  from  left  to  right.  The  messages  mi  through  m$  have  been  exchanged  between 
the  processes.  A  breakpoint  has  occurred  in  Process  1  at  the  location  marked  “x.”  The  causal 
distributed  breakpoint  for  this  breakpoint  event  consists  of  the  process  states  at  the  intersection 
line  A  with  the  lines  representing  the  execution  of  each  process. 

A  B 


Figure  1  Distributed  Breakpoints  for  Event  3  of  Process  1. 
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3.2  Relation  to  Other  Definitions 

Miller  and  Choi  [10],  and  Haban  and  Weigel  [5]  define  a  distributed  breakpoint  as  any  consistent 
system  state  that  includes  the  breakpoint  state.  In  an  arbitrary  consistent  state,  the  processes 
other  than  the  breakpoint  process  can  be  in  any  state  that 

1.  is  equal  to  or  later  than  the  earliest  state  that  reflects  adl  events  that  happened  before  the 
breakpoint  event,  and 

2.  is  earlier  than  the  first  event  that  the  breakpoint  event  happened  before. 

By  contrast,  in  a  causal  distributed  breakpoint,  all  processes  are  at  the  earliest  state  that  reflects 
all  events  that  happened  before  the  breakpoint  state.  In  Figure  1,  a  consistent  system  state  is  any 
system  state  consisting  of  process  states  between  lines  A  and  B.  The  causal  distributed  breakpoint 
is  the  system  state  indicated  by  line  A. 

The  continued  execution  of  a  process  beyond  its  state  in  the  causal  distributed  breakpoint  may 
obscure  the  causal  relationships  on  which  the  state  of  the  breakpoint  process  depends.  For  instance, 
in  Figure  1,  in  the  consistent  system  state  denoted  by  line  B,  Process  3  has  received  message  m3. 
As  a  result,  the  state  of  Process  3  may  have  changed  so  as  to  destroy  any  evidence  of  why  Process  3 
sent  message  m2,  which  had  a  causal  effect  on  Process  1  at  the  breakpoint  state. 

4  The  Causal  Distributed  Breakpoint  Algorithm 

Each  message  carries  with  it  the  event  index  of  its  sender  at  the  time  the  message  was  sent. 
Dependency  information  is  logged  for  each  message  received,  and  consists  of  the  identification  of 
the  sending  and  receiving  processes  and  the  event  indices  corresponding  to  the  sending  and  the 
receiving  of  the  message. 

The  algorithm  relies  on  the  observation  that  if  event  tj  in  process  j  happened  before  event  Oi 
in  process  i,  then  all  events  from  0  to  t}  -  1  in  process  j  also  happened  before  event  <7,  in  process  i. 
Hence,  it  suffices  to  know  for  each  event  <7*  the  latest  event  of  each  other  process  that  happened 
before  <7 ;.  We  define  the  dependency  vector  [6]  of  event  cr;  of  process  i,  DVf,  as  a  vector 

DVf  =  (6l,62.6-i  ..Jn)  , 

where  n  is  the  total  number  of  processes  in  the  sys  .  Component  of  the  dependency  vector 
of  event  of  process  i  is  always  set  to  <7,.  Component  6:  of  the  dependency  vector  of  event  of 
process  i  is  the  largest  event  index  that  was  received  by  process  i  in  a  message  sent  by  process  j. 
If  process  i  has  not  received  a  message  from  process  j,  then  6j  is  set  to  X,  which  is  less  than  all 
possible  event  indices.  The  dependency  vector  of  any  event  can  be  computed  easily  using  the  log 
for  that  process,  given  that  for  each  message  received  the  sender’s  identification  and  the  send’s 
event  index  are  recorded  in  the  log. 

This  dependency  vector  records  the  direct  dependencies  between  events,  resulting  from  the  fact 
that  send  events  happen  before  the  corresponding  receive  events.  However,  transitive  dependencies 
may  also  exist  between  events  in  two  processes  i  and  j,  as  a  result,  for  instance,  of  a  message  sent 
from  process  i  to  process  k  and  a  message  sent  later  from  process  k  to  process  j.  Such  transitive 
dependencies  must  be  taken  into  account  when  computing  the  causal  distributed  breakpoint. 
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Figure  2  shows  the  algorithm  for  finding  the  causal  distributed  breakpoint  for  event  a  of  pro¬ 
cess  i.  The  algorithm  computes  for  each  process  the  latest  event  that  happened  before  the  break¬ 
point  event.  Procedure  CausalBkpt  starts  by  initializing  the  causal  distributed  breakpoint,  CDB, 
to  contain  the  breakpoint  event  a  for  process  i,  and  -L  for  all  other  processes.  The  algorithm 
then  examines  the  dependency  vector  of  each  event  it  includes  in  CDB  by  recursively  invoking 
VisitEvent.  When  VisitEvent  finds  an  event  index  that  is  larger  than  the  index  of  the  corre¬ 
sponding  process  already  included  in  CDB,  CDB  is  altered  to  include  the  new,  larger  event  index, 
and  the  dependency  vector  of  that  event  is  also  examined.  Thus,  VisitEvent  recursively  includes 
all  events  that  happened  before  the  breakpoint  event.  Hence,  when  the  recursion  terminates,  CDB 
holds  the  event  indices  of  the  causal  distributed  breakpoint. 

Suppose  that  in  Figure  1  we  wish  to  find  the  causal  distributed  breakpoint  for  the  state  marked 
with  an  “x”  in  Process  1.  CDB  is  initialized  to  (2,  J_,  _L),  and  VisitEvent  is  invoked  for  event  2 
of  Process  1.  The  dependency  vector  DVj  for  that  event  is  (2,3, 1).  The  event  index  for  Process  2 
in  this  dependency  vector  is  3,  and  is  not  included  yet  in  CDB.  Thus,  CDB  is  set  to  (2,3,  ±)  and 
VisitEvent  is  invoked  with  event  3  of  Process  2.  The  dependency  vector  DVj  for  that  event  is 
(±,3,2).  The  event  index  for  Process  3  in  this  dependency  vector  is  2  and  is  not  included  yet  in 
CDB.  Thus,  CDB  is  set  to  (2,3,2)  and  VisitEvent  is  invoked  with  event  2  of  Process  3.  The 
dependency  vector  DV§  for  that  event  is  (±,  ±,2).  All  of  its  events  have  been  included  in  CDB; 
hence  this  invocation  of  VisitEvent  returns  without  making  changes  to  CDB,  as  do  both  its 


CausalBkpt  (i  :  process,  a  :  event  index) 
/*  Causal  distributed  breakpoint  for  */ 
/*  CDB  holds  the  result.  */ 
for  all  k  ^  t 
CDB[k]  =  ± 
end  for. 

CDB[i]  =  o 
VisitEvent(  i,  <r  ) 
end  CausalBkpt. 

VisitEvent  (j  :  process,  r  :  event  index) 
/*  Put  the  dependencies  of  t}  into  CDB  */ 
for  all  k  ^  j 
a  =  D  VJ[fc] 
if  a  >  CDB[k] 

CDB[fc]  =  a 
VisitEvent(  fc,  a  ) 
endif 
end  for. 
end  VisitEvent. 


Figure  2  Causal  Distributed  Breakpoint  Algorithm. 
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ancestor  invocations.  The  final  value  of  CDB  is  (2,3,2),  which  is  the  causal  distributed  breakpoint 
for  the  state  marked  with  an  in  Process  1. 

A  dual  of  the  causal  distributed  breakpoint  algorithm  finds  the  state  indicated  by  line  B  in 
Figure  1,  which  we  refer  to  as  the  upper  bound  of  execution  with  respect  to  the  breakpoint  event. 
The  causal  distributed  breakpoint  algorithm  can  be  distributed  by  making  VisitEvent  a  remote 
procedure  call  to  the  process  named  in  the  arguments  and  sending  the  current  value  of  CDB  with 
the  remote  procedure  call. 

In  the  worst  case,  the  complexity  of  the  algorithm  is  linear  in  the  number  of  messages  exchanged 
by  the  computation,  since  each  dependency  may  invoke  transitive  dependencies  on  each  process  for 
which  VisitEvent  has  not  yet  been  invoked.  VisitEvent  is  only  invoked  by  a  direct  dependency, 
so  it  will  never  revisit  the  same  event.  A  lower  bound  for  complexity  is  0(n),  for  a  system  of  n 
processes,  since  each  process  must  be  examined  at  least  once. 

5  Approximately  Consistent  Checkpoint  Sets 

5.1  Motivation 

Recreating  a  causal  distributed  breakpoint  requires  that  each  process  i  be  restored  to  the  process 
state  immediately  after  the  occurrence  of  the  event  with  event  index  CDB[t]. 

If  only  the  dependency  information  is  recorded  in  the  log,  then  every  process  must  be  restarted 
from  its  initial  state.  Since  the  order  of  receipt  of  messages  is  recorded  in  the  log,  all  messages  can 
be  replayed  to  each  process  in  the  same  order  as  in  the  original  execution.  Restoration  is  complete 
when  each  process  t’s  event  index  equals  CDB[i]. 

If  the  message  data  as  well  as  the  dependency  information  are  recorded  in  the  log  for  each 
message  received,  and  processes  take  occasional  checkpoints,  then  it  suffices  to  restart  each  process  i 
to  the  state  recorded  in  its  latest  checkpoint  before  CDB[i],  and  replay  from  the  log  the  messages 
received  since  that  checkpoint  until  the  event  index  equals  CDB[i]. 

Recording  the  message  data  for  each  message  received  may  result  in  impractically  large  logs. 
Recording  only  the  dependency  information  produces  much  smaller  logs,  but  leads  to  potentially 
long  restoration  times.  Approximately  consistent  checkpoint  sets  are  used  to  reduce  the  size  of  the 
logs  while  keeping  restoration  times  short. 

5.2  Method 

A  checkpoint  set  comprises  a  checkpoint  for  each  process  in  the  distributed  computation.  To 
establish  an  approximately  consistent  checkpoint  set,  a  two-phase  protocol  is  used.  The  initiator 
of  a  checkpoint  broadcasts  an  out-of-band  checkpoint  request  to  all  processes  in  the  computation. 
The  checkpoint  request  is  retransmitted  until  an  acknowledgment  is  received  from  all  processes.  A 
monotonically  increasing  checkpoint  identifier  is  transmitted  in  the  checkpoint  request  message  to 
act  as  a  sequence  number  for  duplicate  suppression  among  checkpoint  requests. 

Upon  receiving  a  checkpoint  request,  a  process  performs  a  checkpoint  and  sends  the  event  index 
recorded  in  its  checkpoint  to  the  coordinator  in  its  acknowledgment  of  the  checkpoint  request.  When 
the  initiator  has  received  acknowledgments  from  all  processes  in  the  computation,  it  performs  a 
checkpoint  and  broadcasts  to  all  processes  an  out-of-band  checkpoint  confirmation  message  that 
gives  the  event  index  recorded  for  each  process  in  the  checkpoint  set. 
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In  the  interval  between  its  checkpoint  and  its  receipt  of  the  checkpoint  confirmation  message, 
each  process  that  receives  a  message  records  not  only  the  dependencies,  but  also  the  data  contained 
in  the  message.  After  the  confirmation  message  is  received,  each  process  compares  the  send  event 
index  of  each  received  message  with  the  sender’s  checkpoint  event  index  received  in  the  checkpoint 
confirmation  message.  If  the  event  index  in  the  message  is  smaller,  the  data  in  the  message  are 
recorded  along  with  the  dependency  information.  Otherwise,  only  the  dependency  information  is 
recorded. 

To  recreate  a  particular  process  state,  it  is  never  necessary  to  restart  any  process  from  a  check¬ 
point  earlier  than  its  checkpoint  in  the  most  recent  approximately  consistent  checkpoint  set  that 
precedes  the  state  to  be  restored.  The  data  of  any  message  sent  after  a  checkpoint  in  the  set  can 
be  recreated  by  restarting  from  the  checkpoint.  Because  all  message  dependencies  are  logged,  the 
message  can  be  replayed  to  the  receiver  in  the  order  that  was  recorded  during  the  original  execu¬ 
tion.  Furthermore,  although  the  data  of  any  message  sent  before  the  checkpoint  in  the  set  cannot 
be  recreated  by  restarting  from  the  checkpoint,  its  data  are  available  in  the  receiver’s  log. 

The  checkpoints  in  an  approximately  consistent  checkpoint  set  are  not  necessarily  consistent, 
because  a  process  t  can  complete  its  checkpoint  and  then  send  a  message  to  process  j,  which  may 
be  received  before  process  j  completes  its  checkpoint.  The  receipt  of  the  message  is  recorded  in 
j's  checkpoint,  but  its  sending  is  not  recorded  in  i's  checkpoint.  Although  the  set  of  checkpoints 
is  inconsistent,  process  t  can  deterministically  execute  forward  from  its  checkpoint  to  the  sending 
of  the  message.  Hence,  the  term  approximately  consistent  checkpoint  set:  The  states  recorded  in 
the  checkpoint  set  may  not  be  a  consistent  system  state,  but  they  are  a  system  state  from  which  a 
consistent  state  can  be  recovered. 

If  the  time  between  the  completion  of  a  checkpoint  and  the  receipt  of  the  checkpoint  confirmation 
message  is  short  compared  to  the  interval  between  checkpoints,  relatively  few  messages  need  to 
have  their  data  logged,  resulting  in  much  smaller  logs.  However,  restoration  times  should  also  be 
short,  since  it  suffices  to  restart  each  process  from  its  checkpoint  in  the  approximately  consistent 
checkpoint  set  preceding  the  state  to  be  restored.  To  further  reduce  the  amount  of  storage  required 
to  support  causal  distributed  breakpoint,  any  checkpoint  set  except  the  initial  state  can  safely 
be  deleted,  without  impairing  the  ability  to  restore  any  event,  albeit  at  the  expense  of  slower 
restoration  times  for  some  system  states. 

6  Implementation  Experience 

6.1  Implementation 

Causal  distributed  breakpoint  has  been  implemented  under  the  V-System  [2].  Each  participating 
host  runs  a  modified  V-System  kernel,  a  logging  server  process  managing  the  logging  of  messages 
received  by  the  host,  and  a  checkpoint  server  process  managing  checkpoint  recording  for  the  host. 

As  messages  are  received  by  a  process,  they  are  saved  in  the  message  buffer  by  the  kernel.  This 
buffer  is  stored  in  the  volatile  memory  of  the  local  logging  server  process,  which  periodically  writes 
the  contents  of  this  buffer  to  the  message  log  file  on  the  shared  network  storage  server.  There  is  a 
separate  message  log  file  for  each  host. 

The  checkpoint  server  implements  full  and  incremental  process  checkpoints.  A  full  checkpoint 
writes  the  entire  address  space  to  the  checkpoint  file,  and  is  used  for  the  first  checkpoint  of  each 
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Table  1 

Performance  of  common  V-System  communication  primitives  with  and  without  logging 


Elapsed  Time  (milliseconds) 

Operation 

With 

Dependencies 

Without 

Data 

Only 

Logging 

Send-Receive-Reply 

1.63 

1.57 

1.37 

Send(lK)-Receive-Reply 

3.31 

2.94 

2.73 

MoveTo(64K) 

89.0 

88.7 

87.8 

process.  Thereafter,  an  incremental  checkpoint  is  used  to  write  only  the  pages  of  the  address  that 
have  been  modified  since  the  last  checkpoint.  Precopying  is  used  to  limit  the  effect  of  checkpointing 
on  execution  [12].  In  precopying,  the  process  is  allowed  to  continue  executing  while  initial  passes 
over  the  address  space  are  made  to  write  the  necessary  pages  to  disk.  The  process  is  then  “frozen,” 
and  the  pages  that  were  changed  since  having  been  written  in  the  last  precopying  pass  are  rewritten 
to  disk.  As  a  result,  most  of  the  checkpointing  activity  occurs  concurrently  with  process  execution. 

The  resulting  implementation  is  efficient,  yet  kernel  modifications  are  minor.  The  most 
performance-critical  operation,  recording  dependencies  in  the  volatile  log,  is  the  only  one  per¬ 
formed  entirely  in  the  kernel.  In  addition  to  several  small  changes  to  the  internal  operation  of 
some  existing  kernel  primitives,  only  four  new  primitives  to  support  checkpointing  and  five  new 
primitives  to  support  logging  were  added  to  the  kernel. 

6.2  Performance 

We  measured  performance  on  a  group  of  diskless  SUN-3/60  workstation,  which  have  20- megahertz 
Motorola  MC68020  processors.  They  are  connected  by  a  10  megabit  per  second  Ethernet  to  a  V- 
System  file  server  running  on  a  SUN-3/160,  with  a  16-megahertz  MC68020  processor  and  a  Fujitsu 
Eagle  disk. 

The  overhead  of  message  logging,  with  and  without  logging  message  data,  for  common  V-System 
communication  operations  is  given  in  Table  1  The  performance  is  shown  for  a  Send-Receive-Reply 
of  a  32-byte  message,  Send-Receive-Reply  of  a  32-byte  message  with  a  1-kilobyte  appended  data 
segment,  and  for  a  MoveTo  bulk  data  transfer  operation  of  64  kilobytes.  All  times  were  measured 
at  the  user-level,  and  show  the  elapsed  time  between  invoking  the  operation  and  its  completion. 
The  difference  between  no  logging  and  logging  the  dependencies  for  Send-Receive-Reply  (with  and 
without  appended  data)  is  approximately  200  microseconds,  indicating  a  fixed  cost  of  approximately 
100  microseconds  per  logging  operation.  The  additional  overhead  for  logging  the  message  data 
stems  from  copying  the  message  (and  its  appended  segment,  if  any)  in  the  log.  The  fact  that  the 
differences  between  the  three  numbers  for  MoveTo  are  so  small  is  because  these  bulk  data  transfers 
are  implemented  as  blasts  of  packets  without  intervening  acknowledgments  [13].  As  a  result,  logging 
occurs  in  parallel  with  the  transmission  and  reception  of  the  next  packet,  so  logging  has  almost  no 
effect  on  the  elapsed  time  of  the  operation. 

The  overhead  of  checkpointing  in  terms  of  elapsed  time  is  on  the  order  of  3  seconds  per  megabyte 
of  address  space.  Since  most  of  this  overhead  occurs  in  parallel  with  program  execution,  its  effect 
on  program  execution  time  is  small. 
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Table  2 

Performance  of  200  x  200  Gaussian  elimination 


Without 

With 

Dependencies 

Approximate 

Logging 

Data 

Only 

Checkpoints 

Time  (seconds) 

86.7 

87.7 

87.3 

89.7 

Log  (kilobytes) 

- 

2225 

238 

288 

We  ran  a  program  performing  Gaussian  elimination  with  partial  pivoting  on  a  given  n  x  n 
matrix  of  floating  point  numbers,  with  no  logging,  with  logging  only  the  dependency  information, 
and  with  logging  both  the  dependency  information  and  the  message  data.  This  program  employs 
a  relatively  large  amount  of  communication,  0(n 2)  for  an  n  x  n  matrix. 

The  results  of  this  experiment  are  shown  in  Table  2.  With  six  slave  processors  computing  on 
a  200  x  200  matrix,  the  execution  time  increased  from  86.7  seconds  without  any  logging,  to  87.3 
with  logging  dependencies  (a  0.7  percent  increase),  and  87.7  with  logging  both  message  data  and 
dependency  information  (a  total  1.2  percent  increase).  The  total  size  of  the  logs  for  all  processes 
was  approximately  240  kilobytes  without  message  data,  and  2.2  megabytes  with  message  data. 
With  approximately  consistent  checkpointing  every  30  seconds,  the  overall  execution  time  is  in¬ 
creased  by  roughly  4  percent.  The  effect  of  approximately  consistent  checkpointing  on  the  size  of 
dependency  logs  is  run-dependent,  since  the  checkpoints  are  initiated  asynchronously  with  respect 
to  the  computation,  and  both  message  frequency  and  message  size  during  the  checkpoint  interval 
are  factors  in  the  size  of  the  logs.  In  the  Gauss  experiments,  the  increase  in  log  size  due  to  data 
logging  during  checkpoint  intervals  is  50  kilobytes  on  the  average. 

7  Related  Work 

Miller  and  Choi  [10]  adapt  Chandy  and  Lamport’s  distributed  snapshots  [1]  to  distributed  break- 
pointing.  Haban  and  Weigel  [5]  use  a  similar  approach  to  that  of  Miller  and  Choi  for  generating 
interactive  breakpoints.  Cooper  [4]  broadcasts  a  halt  request  out-of-band,  which  produces  a  consis¬ 
tent  system  state  when  used  during  normal  execution,  or  in  replay  from  logs  containing  dependencies 
only.  However,  none  of  these  methods  gives  the  earliest  system  state  that  reflects  all  events  that 
happened  before  the  breakpoint  event,  as  does  causal  distributed  breakpoint.  In  fairness  to  the 
work  of  Miller  and  Choi,  and  that  of  Haban  and  Weigel,  an  implementation  using  their  definitions 
of  distributed  breakpoint  does  not  require  that  process  execution  be  deterministic,  as  ours  does. 

In  a  later  paper,  Choi  et  al.  develop  the  notion  of  a  before  and  an  after  state  in  a  parallel 
program,  relative  to  a  node  in  a  parallel  dynamic  program  dependency  graph  [3].  Their  before 
state  resembles  our  notion  of  causal  distributed  breakpoint.  However,  they  do  not  address  the  issue 
of  how  to  construct  the  dynamic  dependency  graph,  which  we  do  by  including  the  event  index  of 
the  sender  in  each  message.  Furthermore,  their  algorithm  computes  the  before  state  for  all  nodes  in 
the  graph,  while  ours  computes  the  causal  distributed  breakpoint  for  a  single  process  state.  Their 
paper  reports  no  implementation  results. 

LeBlanc  and  Mellor-Crummey  observe  in  Instant  Replay  that  the  size  of  the  logs  necessary  to 
replay  may  be  a  concern  [9].  They  point  out  that  it  is  not  necessary  to  store  the  message  data  if  all 
message  communication  can  be  described  as  serializable  reads  and  writes  of  shared  objects.  Our 
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method  of  logging  only  dependency  information  is  similar  to  theirs,  although  not  restricted  to  any 
particular  form  of  read/write  access.  Our  use  of  approximately  consistent  checkpoint  sets  improves 
replay  response  time. 

Causal  dependency  tracking  by  including  the  event  index  of  the  sender  in  each  message,  and 
the  notion  of  a  dependency  vector  are  due  to  the  work  of  Johnson  and  Zwaenepoel  on  optimistic 
message  logging  for  fault  tolerance  [6].  The  goal  of  replay  in  optimistic  fault  tolerance  is  to  achieve  a 
maximum  recoverable  system  state.  As  a  result,  they  use  process  state  interval  indices,  incremented 
only  upon  receipt  of  a  message,  instead  of  our  event  indices,  which  are  incremented  for  each  event. 
Their  checkpoint  and  recovery  method  requires  that  they  log  both  dependencies  and  message  data. 
An  alternative  method  of  dependency  tracking  by  including  the  full  dependency  vector  with  each 
message  sent  was  introduced  earlier  by  Strom  and  Yemini  [11]. 

The  approximately  consistent  checkpoint  set  bears  some  similarity  to  the  Chandy-Lamport 
snapshot  algorithm  [1]  and  the  consistent  checkpoints  of  Koo  and  Toueg  [7],  Chandy  and  Lamport 
assume  reliable  communication  channels,  where  we  do  not.  We  assume  deterministic  process  ex¬ 
ecution,  which  they  do  not.  We  use  these  checkpoints  for  replay,  while  they  use  the  checkpoints 
and  the  channel  states  for  detecting  stable  properties  of  computations.  As  a  result  ,  in  the  snapshot 
algorithm,  messages  sent  by  a  process  after  its  checkpoint  must  not  be  received  before  the  check¬ 
point  of  the  recipient.  For  replay  of  deterministic  processes,  this  is  allowable  if  the  original  order 
of  message  receipt  is  reproduced.  Koo  and  Toueg  generate  consistent  checkpoints,  at  the  expense 
of  prohibiting  message  traffic  during  the  execution  of  the  checkpoint  algorithm.  Our  algorithm 
does  not  prohibit  message  traffic  at  any  time,  but  requires  logging  of  the  dependency  information 
and  some  message  data,  and  assumes  deterministic  processes.  Our  algorithm  does  not  produce 
consistent  checkpoints,  but  achieves  a  consistent  system  state  through  re-execution. 

8  Conclusion 

We  have  defined  causal  distributed  breakpoint,  which  naturally  extends  the  notion  of  a  sequential 
breakpoint  to  distributed  systems,  where  only  a  partial  ordering  between  events  can  be  observed. 
When  one  process  of  a  distributed  computation  is  halted  at  a  breakpoint,  the  causal  distributed 
breakpoint  shows  the  other  processes  in  the  computation  in  their  earliest  state  that  reflects  all 
events  that  happened  before  the  breakpoint.  Previous  distributed  breakpoint  definitions  allow  any 
consistent  system  state  including  the  breakpoint.  This  may  obscure  the  causal  relationship  between 
the  states  of  other  processes  and  the  state  of  the  breakpoint  process. 

We  have  also  introduced  the  notion  of  approximately  consistent  checkpoint  sets,  which  allow 
quick  restoration  of  process  states  with  relatively  small  logs.  Our  implementation  shows  that  the 
overhead  of  dependency  logging  has  little  effect  on  program  execution  time,  and  that  log  sizes  using 
approximately  consistent  checkpoint  sets  are  acceptable. 
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